Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

نویسندگان

Domenico Cantone

Simone Faro

Emanuele Giaquinta

چکیده

In this paper we propose an efficient approach to the compressed string matching problem on Huffman encoded texts, based on the Boyer-Moore strategy. Once a candidate valid shift has been located, a subsequent verification phase checks whether the shift is codeword aligned by taking advantage of the skeleton tree data structure. Our approach leads to algorithms that exhibit a sublinear behavior on the average, as shown by extensive experimentation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Very fast pattern matching for highly repetitive text

This paper describes two searching methods for locating longest string matches in source texts of low entropy. A modi cation of the Boyer-Moore scanning algorithm and a statistical method, which searches for less likely symbols, are presented. Both algorithms have been implemented as part of the searching strategy for an LZ77 type encoder. Experimental results are included.

متن کامل

Exact pattern matching: Adapting the Boyer-Moore algorithm for DNA searches

Exact pattern matching aims to locate all occurrences of a pattern in a text. Many algorithms have been proposed, but two algorithms, the Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM), are most widespread. It is the basis of some approximate string matching algorithms like BLAST, and in many cases it is desirable to locate an exact rather than approximate matches. Although several studies i...

متن کامل

Improving semistatic compression via phrase-based modeling

In the last years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a...

متن کامل

Occurrence and Substring Heuristics for i-Matching

We consider a version of pattern matching useful in processing large musical data: matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols , is measured as . We also consider -matching, where is a bound on the total sum of the diff...

متن کامل